[megatron, model] feat: qwen3.5 example #5381
[megatron, model] feat: qwen3.5 example #5381ISEEKYAN wants to merge 8 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds support for Qwen3.5 SFT with Megatron. The changes are mostly workarounds and fixes to support the Qwen3.5 architecture, particularly its Gated Delta Net (GDN) and chat template requirements. The changes look reasonable and well-commented, improving compatibility and robustness. I have one major concern about catching a broad Exception which could hide bugs.
| return_tensors="pt", | ||
| **apply_chat_template_kwargs, | ||
| ) | ||
| except (jinja2.exceptions.TemplateError, Exception) as e: |
There was a problem hiding this comment.
Catching a generic Exception is risky as it can suppress unexpected errors, making debugging difficult. It's better to catch more specific exceptions. Since jinja2.exceptions.TemplateError is a subclass of Exception, the tuple (jinja2.exceptions.TemplateError, Exception) is redundant and equivalent to except Exception:. Please replace Exception with the specific exception type(s) that are expected to contain the 'No user query' message. If the exact type is unknown, consider catching a narrower set of exceptions like ValueError or TypeError which are common for such issues.
|
@ISEEKYAN does this pr can also support rl? |
just updated a script with RL supports. But it is not easy to prepare a right vllm dependency now🥲 |
Many thanks. the vllm qwen3.5 during initialization, need to be fixed."so what issue is there with vllm qwen3.5 initialization? I see in vllm doc that vllm can indeed serve qwen3.5(https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3.5.html). |
|
Successfully ran Qwen3.5 SFT (verl megatron example) with the following setup: (1) mbridge: install from source for qwen3_5 support — (2) megatron-core == 0.16.0 — required for attention_output_gate and other GDN options. (3) verl patch in verl/models/mcore/patch.py: applies the gate-slicing fix when Key library versions used:
|
| # Qwen3.5 uses Gated Delta Net (GDN) linear attention which currently does | ||
| # NOT support packed sequences (THD format) in Megatron-LM. Therefore: | ||
| # - actor.megatron.use_remove_padding=False (forces bshd compute format) | ||
| # - model.use_remove_padding=True (keeps NestedTensor in data pipeline) |
There was a problem hiding this comment.
For new model engine, I think we always use NestedTensor regardless of model.use_remove_padding?
| # Try the fast path first: direct unbind works for some NestedTensor | ||
| # layouts where the batch dim is not entangled with the ragged dim. | ||
| try: | ||
| tensors = nt.unbind(dim=0) |
There was a problem hiding this comment.
In which case nested tensor unbind failed? I didn't expect that unbind may failed.
There was a problem hiding this comment.
3D jagged tensors (e.g., MRoPE position_ids)
|
It seems that the vllm nightly version which support Qwen3.5 require transformers==4.57.6, which conflicts with the latest transformers version 5.2.0 |
you could try with sglang |
try first install vllm==0.15.0 then install transformers==5.2.0 |
|
Successfully ran Qwen3.5 MoE GRPO (this PR example) with the following setup: however, when i ran Qwen3.5-27B, error raise: File "/root/verl/verl/workers/megatron_workers.py", line 887, in compute_log_prob
output, entropys, layers_topk_idx = self.actor.compute_log_prob(data=data, calculate_entropy=not is_lora)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/verl/verl/utils/profiler/performance.py", line 105, in f
return self.log(decorated_function, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/verl/verl/utils/profiler/performance.py", line 118, in log
output = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/verl/verl/workers/actor/megatron_actor.py", line 254, in compute_log_prob
output = self.forward_backward_batch(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/verl/verl/workers/actor/megatron_actor.py", line 716, in forward_backward_batch
losses_reduced = forward_backward_func(
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/megatron/core/pipeline_parallel/schedules.py", line 636, in forward_backward_no_pipelining
output_tensor, num_tokens = forward_step(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/megatron/core/pipeline_parallel/schedules.py", line 423, in forward_step
output_tensor, loss_func = forward_step_func(data_iterator, model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/verl/verl/workers/actor/megatron_actor.py", line 666, in forward_step
output = forward_fn(
^^^^^^^^^^^
File "/root/verl/verl/models/mcore/model_forward.py", line 141, in model_forward
output_orig = model(
^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/megatron/core/distributed/data_parallel_base.py", line 22, in forward
return self.module(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/megatron/core/transformer/module.py", line 489, in forward
outputs = self.module(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/mbridge/models/qwen3_5/model.py", line 367, in forward
output = self.language_model(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/megatron/core/models/gpt/gpt_model.py", line 504, in forward
preproc_output = self._preprocess(
^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/megatron/core/models/gpt/gpt_model.py", line 388, in _preprocess
rotary_pos_emb = self.rotary_pos_emb(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/mbridge/models/qwen3_vl/rope_utils.py", line 130, in forward
seq_expanded = seq[:, :, None, :].float()
~~~^^^^^^^^^^^^^^^
IndexError: too many indices for tensor of dimension 2@ISEEKYAN is this PR support Qwen3.5-27B? or any example for Qwen3.5-27B? |
@zyfzjsc988 I only did exp on moe version, let me check if there is any bug with dense version. cc @LiuXTao |
@zyfzjsc988 Hi, thanks for sharing your setup and experience with the Qwen3.5 MoE GRPO example! I tried to reproduce your environment and also installed transformer-engine, but I'm still encountering an error: I suspect that vllm==0.15.0 might not be the correct version to support this model architecture. Shall we install the nightly version or if there are any additional steps needed to make it work? Any guidance would be greatly appreciated. Thanks in advance! |
hi, @ISEEKYAN i fix this bug and run class SupportedVLM(Enum):
QWEN2_5_VL = "Qwen2_5_VLForConditionalGeneration"
QWEN3_MOE_VL = "Qwen3VLMoeForConditionalGeneration"
QWEN3_VL = "Qwen3VLForConditionalGeneration"
QWEN3_5_MOE_VL = "Qwen3_5MoeForConditionalGeneration"
QWEN3_5_VL = "Qwen3_5ForConditionalGeneration"but File "/root/verl/verl/workers/megatron_workers.py", line 887, in compute_log_prob
output, entropys, layers_topk_idx = self.actor.compute_log_prob(data=data, calculate_entropy=not is_lora)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/verl/verl/utils/profiler/performance.py", line 105, in f
return self.log(decorated_function, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/verl/verl/utils/profiler/performance.py", line 118, in log
output = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/verl/verl/workers/actor/megatron_actor.py", line 254, in compute_log_prob
output = self.forward_backward_batch(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/verl/verl/workers/actor/megatron_actor.py", line 716, in forward_backward_batch
losses_reduced = forward_backward_func(
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/megatron/core/pipeline_parallel/schedules.py", line 636, in forward_backward_no_pipelining
output_tensor, num_tokens = forward_step(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/megatron/core/pipeline_parallel/schedules.py", line 423, in forward_step
output_tensor, loss_func = forward_step_func(data_iterator, model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/verl/verl/workers/actor/megatron_actor.py", line 666, in forward_step
output = forward_fn(
^^^^^^^^^^^
File "/root/verl/verl/models/mcore/model_forward.py", line 141, in model_forward
output_orig = model(
^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/megatron/core/distributed/data_parallel_base.py", line 22, in forward
return self.module(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/megatron/core/transformer/module.py", line 489, in forward
outputs = self.module(*inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/mbridge/models/qwen3_5/model.py", line 367, in forward
output = self.language_model(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/megatron/core/models/gpt/gpt_model.py", line 525, in forward
hidden_states = self.decoder(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/megatron/core/transformer/transformer_block.py", line 619, in __call__
return super().__call__(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/megatron/core/transformer/module.py", line 352, in __call__
return super().__call__(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/megatron/core/transformer/transformer_block.py", line 765, in forward
hidden_states, context = layer(
^^^^^^
File "/usr/local/lib/python3.12/site-packages/megatron/core/transformer/transformer_layer.py", line 1217, in __call__
return super().__call__(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/megatron/core/transformer/module.py", line 352, in __call__
return super().__call__(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/megatron/core/transformer/transformer_layer.py", line 513, in forward
hidden_states, context = self._forward_attention(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/megatron/core/transformer/transformer_layer.py", line 597, in _forward_attention
attention_output_with_bias = self.self_attention(
^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/mbridge/models/qwen3_5/attention.py", line 360, in forward
core_attn_out = self._apply_output_gate(core_attn_out, gate)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/megatron/core/transformer/attention.py", line 1221, in _apply_output_gate
gate = gate.view(*x.shape)
^^^^^^^^^^^^^^^^^^^
RuntimeError: shape '[776, 1, 768]' is invalid for input of size 1191936 |
please try install vllm v0.16.1rc0 |
|
@zyfzjsc988 @ISEEKYAN Thank you so much for your previous help. I've successfully run the GRPO training script. However, I encountered an issue with As a temporary workaround, I modified the transformers source code locally: if ignore_keys_at_rope_validation is None:
ignore_keys_at_rope_validation = set()
elif not isinstance(ignore_keys_at_rope_validation, set):
ignore_keys_at_rope_validation = set(ignore_keys_at_rope_validation)After this change, the training ran successfully. I was wondering if you have encountered the same issue, or if there's a more proper fix (e.g., updating to a newer transformers version)? Thanks again for your guidance! |
|
when i use qwen3.5 9b the follow error happen ray.exceptions.RayTaskError(NameError): ray::WorkerDict.ref_init_model() (pid=507880, ip=10.141.0.100, actor_id=1ef0b433cb57b0ae2fef392701000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x15247d7a7130>) |
I've also successfully run the GRPO training script. modify to set(ignore_keys_at_rope_validation) is a proper fix currently. but I encountered CPU OOM when save_checkpoints. |
What does this PR do?
thanks to @LiuXTao 's great work on ISEEKYAN/mbridge#83, the mbridge has supported qwen3.5.
This PR succeeded in running qwen3.5 SFT on verl based on mbridge supports for qwen3.5
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
see
examples/sft/gsm8k/run_qwen3_5_megatron.shand
examples/grpo_trainer/run_qwen3_5-35b-megatron.shAPI and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.